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Summary. Probability Collectives (PC) provides the information-theoretic exten- 
sion of conventional full-rationality game theory to bounded rational games. Here 
an explicit solution to the equations giving the bounded rationality equilibrium of 
a game is presented. Then PC is used to investigate games in which the players use 
bounded rational best-response strategies. Next it is shown that in the continuum- 
time limit, bounded rational best response games result in a variant of the replicator 
dynamics of evolutionary game theory. It is then shown that for team (shared-payoff) 
games, this variant of replicator dynamics is identical to Newton-Raphson iterative 
optimization of the shared utility function. 


1 Introduction 

Recent work has used information theory [9, 12] to provide a principled ex- 
tension of noncooperate conventional game theory to accommodate bounded 
rationality [25, 27]. Intuitively, this extension starts with the observation that 
in the real world ascertaining a game’s equilibrium is an exercise in statistical 
inference: one is given (or assumes) partial information about the behavior of 
the players, and from that infers (!) what the joint mixed strategy is likely to 
be. There are many ways to do such statistical inference. The one investigated 
in [27] is based on information theory’s version of Occam’s razor: Predict the 
joint mixed strategy that has as little extra information as possible beyond 
the provided partial knowledge while being consistent with that knowledge. 
This version of Occam’s razor is known as the Maximum entropy (Maxent) 
principle [9, 12]. It tells us that the mixed strategy of a game’s equilibrium, 
q(x € X) — n r g l (x l ), is the solution to a coupled set of Lagrangian functions 
that are specified by the game structure and the provided partial knowledge. 

Sec. 2 reviews how information theory can be used to derive bounded 
rational noncooperative game theory. Some simple examples of the bounded 
rational equilibrium solutions of games are then presented. Sec. 3 analyzes 
scenarios in which the players use bounded rational versions of best response 
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strategies. Particular attention is paid to team games, in which the players 
share the same utility function. The analysis for this case provide insight into 
how to optimize the sequence of moves by the players, as far as their shared 
utility is concerned. This can be viewed as a formal way to optimize the 
organization chart of a corporation. 

Best response strategies, even bounded rational ones, are poor models of 
real-world computational players that use Reinforcement Learning (RL) [20]. 
Sec. 4 considers iterated games in which players use a (bounded rational) 
variant of best response, a variant that is more realistic for computational 
players, and arguably for human players as well. In this variant the conditional 
expected utilities used by player i to update her strategy, expected payoff given 
move Xi, is a decaying average of recent conditional expected utilities. This 
decay biases the player to dampen large and sudden changes in her strategy. 
This, variant is then explored for the case, of team games. The. -continuum 
limit of the dynamics of such games is shown to be variant of the replicator 
dynamics. It is shown such continuum-limit bounded rational best response 
is identical to Newton- Raphson iterative optimization of the shared utility 
function of such games. 

The formalism presented in this paper is a special case of the field of 
Probability Collectives (PC), a case in which the joint distribution over the 
variables of interest is a product distribution. This special case is known as 
Product Distribution (PD) theory [25, 27, 29, 28, 26, 7]. PC has many appli- 
cations beyond those considered in this paper, e.g., distributed optimization 
and control [16, 15, 2, 29]. Finally, see [16] for relations to other work in game 
theory, optimization, statistical physics, and reinforcement learning. 


2 Bounded Rational Noncooperative Game Theory 

In this section we motivate PD theory as the information-theoretic formula- 
tion of bounded rational game theory. We use the integral sign (/) with the 
associated measure implicit, i.e., it indicates sums if appropriate, Lebesgue 
integrals over M n if appropriate, etc. In addition, the subscript (i) is used to 
indicate all index values other than i. Finally, we use V to indicate the set of 
all probability distributions over a vector space, and Q to indicate the sub- 
set of V consisting of all product distributions (i.e., the associated Cartesian 
product of unit simplices). 

In noncooperative game theory one has a set of N players. Each player i 
has its own set of allowed pure strategies. A mixed strategy is a distri- 
bution qi(xi) over player Fs possible pure strategies. Each player i also has a 
private utility function gi that maps the pure strategies adopted by all N of 
the players into the real numbers. So given mixed strategies of all the players, 
the expected utility of player i is E(gi) = f dx . qj(xj)gi(x). 

In a Nash equilibrium every player adopts the mixed strategy that maxi- 
mizes its expected utility, given the mixed strategies of the other players. More 
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formally, Vi, qi = argmax^/ f dx q[ Qj( x j) 9i( x )- Perhaps the major ob- 
jection that has been raised to the Nash equilibrium concept is its assumption 
of full rationality [10, 6, 18, 4]. This is the assumption that every player i 
can both calculate what the strategies q 3 ^ x will be and then calculate its as- 
sociated optimal distribution- In other words, it is the assumption that every 
player will calculate the entire joint distribution q(x) = JJ . qj(xj). 

In the real world, this assumption of full rationality almost never holds, 
whether the players are humans, animals, or computational agents [5, 17, 
10, 3, 8, 1, 22, 14]. This is due to the cost of computation of that optimal 
distribution, if nothing else. This real-world bounded rationality is a major 
impediment to applying conventional game theory in the real world. 


2.1 Review of the minimum information principle 

Shannon was the first person to realize that based on any of several separate 
sets of very simple desiderata, there is a unique real-valued quantification of 
the amount of syntactic information in a distribution P(y). He showed that 
this amount of information is the negative of the Shannon entropy of that 
distribution, S(P) — — J dy P(y)ln[— ^]. So for example, the distribution 
with minimal information is the one that doesn’t distinguish at all between 
the various y 1 Le., the uniform distribution. Conversely, the most informative 
distribution is the one that specifies a single possible y. Note that for a product 
distribution, entropy is additive, i.e., S([\ i qi(yi)) = Yli £(<?»)* 

Say we given some incomplete prior knowledge about a distribution P(y ). 
How should one estimate P(y) based on that prior knowledge? Shannon’s re- 
sult tells us how to do that in the most conservative way: have your estimate 
of P(y) contain the minimal amount of extra information beyond that already 
contained in the prior knowledge about P(y). Intuitively, this can be viewed 
as a version of Occam’s razor: introduce as little extra information beyond 
that you are provided in your inferring of P. This minimum information ap- 
proach is called the maxent principle. It has proven extremely powerful in 
domains ranging from signal processing to supervised learning [12]. In partic- 
ular, it is has been successfully used in many statistics applications, includ- 
ing econometrics [13], It has even provided what many consider the cleanest 
derivation of the foundations of statistical physics [11]. 


2.2 Maxent Lagrangians 

Much of the work on equilibrium concepts in game theory adopts the per- 
spective of an external observer of a game. We are told something concerning 
the game, e.g., its cost functions, information sets, etc., and from that wish to 
predict what joint strategy will be followed by real-world players of the game. 
Say that in addition to such information, we are told the expected utilities 
of the players. What is our best estimate of the distribution q that generated 
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those expected cost values? By the maxent principle, it is the distribution 
with maximal entropy, subject to those expectation values. 

To formalize this, for simplicity assume a finite number of players and of 
possible strategies for each player. To agree with the convention in fields other 
than game theory (e.g., optimization, statistical physics, etc.), from now on 
we implicitly flip the sign of each gi so that the associated player i wants to 
minimize that function rather than maximize it. Intuitively, this flipped g%{x) 
is the “cost” to player i when the joint-strategy is x. 

With this convention, given prior knowledge that the expected utilities of 
the players are given by the set of values {e*}, the maxent estimate of the 
associated q is given by the minimizer of the Lagrangian 

J?(q) = £&[£,(*) - <*] - S(q) (1) 

i 

= '52P i [f dx Yi<lj(xj)gi(x)-ti]-S(q) ( 2 ) 

i 3 

where the subscript on the expectation value indicates that it evaluated un- 
der distribution q. The {$} are “inverse temperatures” implicitly set by the 
constraints on the expected utilities. 

Solving, we get the coupled equations 

?<(*<) «e" i W 0|x<) (3) 

where the overall proportionality constant for each i is set by normalization, 
and G = Yhi Pi9i I n Eq. 3 the probability of player i choosing pure strategy 
Xi depends on the effect of that choice on the utilities of the other players. 
This reflects the fact that our prior knowledge concerns all the players equally. 

If we wish to focus only on the behavior of player f, it is appropriate to 
modify our prior knowledge. First consider the case of maximal prior knowl- 
edge, in which we know the actual joint-strategy of the players, and therefore 
all of their expected costs. For this case, trivially, the maxent principle says 
we should “estimate” q as that joint-strategy (it being the q with maximal 
entropy that is consistent with our prior knowledge). The same conclusion 
holds if our prior knowledge also includes the expected cost of player i. 

Modify this maximal set of prior knowledge by removing from it specifica- 
tion of player V s strategy. So our prior knowledge is the mixed strategies of all 
players other than z, together with player V s expected cost. We can incorpo- 
rate prior knowledge of the other players’ mixed strategies directly, without 
introducing Lagrange parameters. The resultant maxent Lagrangian is 

JzfiOfc) = /?i[ei - Eq(gi ) J - Si(qi) 
solved by a set of coupled Boltzmann distributions: 

1 The subscript q (j) on the expectation value indicates that it is evaluated ac- 
cording the distribution JJ Qj- 
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qi{xi) oce (4) 

Following Nash, we can use Brouwer’s fixed point theorem to establish that 
for any non-negative values {/?}, there must exist at least one product distri- 
bution given by the product of these Boltzmann distributions (one term in 
the product for each i). 

The first term in Jzf, is minimized by a perfectly rational player. The second 
term is minimized by a perfectly irrational player, i.e., by a perfectly uniform 
mixed strategy g*. So j3 x in the maxent Lagrangian explicitly specifies the bal- 
ance between the rational and irrational behavior of the player. In particular, 
for /3 — ► oc, by minimizing the Lagrangians we recover the Nash equilibria 
of the game. More formally, in that limit the set of q that simultaneously 
minimize the Lagrangians is the set of mixed strategy equilibria of the game, 
together with the set of delta functions about the pure Nash equilibria of the 
game. The same is true for Eq. 3. 

Note also that independent of information- theoretic considerations, the 
Boltzmann distribution is a reasonable (highly abstracted) model of human 
behavior. Typically humans do some “exploration” as well as “exploitation”, 
trying each move with probability that rises as the expected cost of the move 
falls. This is captured in the Boltzmann distribution mixed strategy. 

One can formalize the concept of the rationality of a player in a way that 
applies to any distribution, not just a Boltzmann distribution. One does this 
with a rationality operator which maps a q and a ^ to a non-negative 
real value measuring the rationality of player i in adopting strategy g* given 
private cost function g t and strategies of the other players. For the solution 

in Eq. 4 and private cost g*, the value of that operator is just & [27]. 

Eq. 3 is just a special case of Eq. 4, where all player’s share the same 
private cost function, G. (Such games are known as team games.) This 
relationship reflects the fact that for this case, the difference between the 
maxent Lagrangian and the one in Eq. 2 is independent of g*. Due to this 
relationship, our guarantee of the existence of a solution to the set of maxent 
Lagrangians implies the existence of a solution of the form Eq. 3. Typically 
players will be closer to minimizing their expected cost than maximizing it. 
For prior knowledge consistent with such a case, the & are all non-negative. 

For each player i define f 1 {x^q i {xi)) = f3igi(x) + Infect)]- Then we can 
write the maxent Lagrangian for player i as 

-£’»(<?) = J dx q(x)fi(x,qi(xi)). (5) 

Now in a bounded rational game every player sets its strategy to minimize its 
Lagrangian, given the strategies of the other players. In fight of Eq. 5, this 
means that we can interpret each player in a bounded rational game as being 
perfectly rational for a cost function that incorporates its computational cost. 
To do so we simply need to expand the domain of “cost functions” to include 
(logarithms of) probability values as well as joint moves. 
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2.3 Examples of bounded rational equilibria 

It can be difficult to start with a set of cost functions and associated ratio- 
nalities pi and then solve for the associated bounded rational equilibrium q. 
Solving for q when prior knowledge consists of expected costs e* rather than 
rationalities can be even more tedious. (In that situation the Pi are not spec- 
ified upfront but instead are Lagrange parameters that we must solve for.) 
However there is an alternative approach to constructing examples of games 
and their bounded rational equilibria that is quite simple. In this alternative 
one starts with a particular mixed strategy q and then solves for a game for 
which q is a bounded rational equilibrium, rather than the other way around. 

To illustrate this, consider a 2-player single-stage game. Let each player 
have 3 possible moves, indicated by the numerals 0, 1, and 2. Say the (bounded 
rational) mixed strategy equilibrium is 

<7i(0) = 1/2, <7i(l) = 1/4, qi(2) = 1/ 4; 

<72(0) = 2/3, q 2 ( 1) = 1/4, q 2 ( 2) = 1/12 . (6) 

Now we know that at the equilibrium, q\{x{) ex e~^ lE ^ 9l ^ Xl \ where Pi 
is player l’s rationality, and g\ is her cost function (the negative of her cost 
function). This means for example that 

e -(/3 l [£(s 1 |* 1 =0)-£( 3l |s 1 = l)]) _ gl(°) _ o : » 

Cr — 4 i.e., 

9i(l) 

0i [£(<71 | *1 = 0) - E(g 1 I *! = 1)] = —1X1(2) . (7) 

A similar equation governs the remaining independent difference in expecta- 
tion values for player 1. The analogous two equations for player 2 also hold. 

Now define the vectors g i;J -(.) = 9i(%i = j , .). So for example g 1;0 = 
(si(zi = 0,x 2 = 0), < 7 i(xi = 0,x 2 = l), 5 i(xi = 0, x 2 = 2)). Then we can 
express our equations compactly as four dot product equalities: 

Pi (gi ;0 - gi ; i ) ' 92 = -ln(2) ; ft(g 1;0 - gi ; 2 ) • 92 = — ln(2) ; 

/?2(g2;0 - g2;l) • 9l = — lll(8/3) J 02 (g2;0 ~ g2;2) • 9l = -ln(8) . (8) 

We can absorb each Pi into its associated g all that matters is their product. 

We can now plug in for the vectors q\ and q 2 from Eq. 6 and simply write 
down a set of solutions for the four three-dimensional vectors g ij. For these 
{gi} the bounded ratinal equilibrium is given by the q of Eq. 6. If desired, we 
can evaluate the associated expected values of the cost functions for the two 
players; our q is the bounded ratinal equilibrium for those expected costs. 

Note that the variables in the first pair of equalities in Eq. 8 are inde- 
pendent of those in the second pair. In other words, whereas the Boltzmann 
equations giving q for a specified set of gi are a set of coupled equations, the 
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equations giving the gi for a specified q axe not coupled. Note also that our 
equations for the g i; j are (extremely) undercoiisiraiiied. This illustrates how 
compressive the mapping from the gi to the associated equilibrium q is. Bear 
in mind though that that mapping is also multi-valued in general; in general 
a single set of cost functions can have more than one equilibrium, just like it 
can have more than one Nash equilibrium. 

The generalization of this example to arbitrary numbers of players with 
arbitrary move spaces is immediate. As before, indicate the moves of every 
player by an associated set of integer numerals starting at 0. Recall that the 
subscript (i) on a vector indicate all components but the z’th one. Also absorb 
the rationalities 0i into the associated 

Now specify q and the vectors gi(xi = 0, .) (one vector for each i) to be 
anything whatsoever. Then for all players z, the only associated constraint on 
the z’th cost function concerns certain projections of the vectors gi{xi > 0, .) 
(one projection for each value Xi > 0). Concretely, Vi, sc* > 0, 

[ dx' (i) g { (n , x' (i) ) JJ qj (x'- ) = -In ( ) + f dx[ i} gi (0, x' (t) ) JJ q, (x' ) , 

J i# qt(i) J JfH 

i.e., Vi,ii > 0,gi(xi,.) • «j (i) = + gi(0,.) -q {i) . (9) 

All the terms on the right-hand side are specified, as well as the q ^ term on 
the left-hand side. Any gi(xi,.) that obeys the associated equation has the 
specified q as a bounded rational equilibrium. 

See [27] for discussion of alternative interpretations of this information- 
theoretic formulation of bounded rationality. That reference also discusses 
kinds of prior knowledge that do not result in the Maxent Lagrangian, in 
particular knowledge based on finite data sets (Bayesian inference). A scalar- 
valued quantification of the rationality of a player is also presented there. 


3 Bounded rational versions of best response 

One crude way to try to find the q given by Eq. 4 would be an iterative pro- 
cess akin to the best-response scheme of game theory [10]. Given any current 
distribution q , in this scheme all agents i simultaneously replace their current 
distributions. In this replacement each agent i replaces q t with the distribu- 
tion given in Eq. 4 based on the current q (j). This scheme is the basis of the 
use of Brouwer’s fixed point theorem to prove that a solution to Eq. 4 exists. 
Accordingly, it is called parallel Brouwer updating. (This scheme goes by 
many names in the literature, from Boltzmann learning in the RL community 
to block relaxation in the optimization community.) 

Sometimes conditional expected costs for each agent can be calculated ex- 
plicitly at each iteration. More generally, they must be estimated. This can 
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be done via Monte Carlo sampling, iterated across a block of time. During 
that block the agents all repeatedly and jointly IID sample their (unchanging) 
probability distributions to generate joint moves, and the associated cost val- 
ues are recorded. These are then use to estimate all the conditional expected 
costs, which then determine the parallel Brouwer update 2 . 

This is exactly what is done in RL-based schemes in which each agent 
maintains a data-based estimate of its cost for each of its possible moves, 
and then chooses its actual move stochastically, by sampling a Boltzmann 
distribution of those estimates. (See [25] for ways to get accurate MC estimates 
more efficiently than in this simple scheme, e.g., by exploiting the bias- variance 
tradeoff of statistics.) 

One alternative to parallel Brouwer updating is serial Brouwer updating, 
where we only update one at a time. This is analogous to a Stackelberg 
game, in that one agent makes its move and then the other(s) respond [4, 
6]. In a team game, any serial Brouwer updating must reduce the common 
Lagrangian, in contrast to the case with parallel Brouwer updating. 

There are many versions of serial updating. In cyclic serial Brouwer up- 
dating, one cycles through the i in order. In random serial Brouwer updating, 
one cycles through them in a random fashion. 

In greedy serial Brouwer updating, instead of cycling through all i, at each 
iteration we choose what single player to update based on the associated drop 
in the common Lagrangian. Those drops can be evaluated without calculating 
the associated Boltzmann distributions. To see how, use Ni to indicate the 
normalization constant of Eq. 4. Then define the Lagrangian gap at q for 
player i as ln[iV;] + J dXiQi(xi)E qw (gi | x») H- / dxiqi(xi)ln[qi(xi)]. This is how 
much j£? is reduced if only undergoes the Brouwer update 3 . 

Another obvious variant of these schemes is mixed serial/parallel Brouwer 
updating, in which one subset of the players moves in synchrony, followed by 
another subset, and so on. Such updating in a team game can be viewed as 
a simple model of the organization chart of the players. For example, this is 
the case when the players are a corporation, with G being a common cost 
function based on the corporation’s performance. 


2 Parallel Brouwer updating has minimal memory requirements on the agents. Say 
agent i has just made a particular move, getting cost r, and that the most recent 
previous time it made that time was T iterations ago. Then the new estimated cost 
for that move, E\ is related to the previous one, F, by E ' = > where k is 

a constant less than 1, and a is initially set to 1, while itself also being updated 
according to a-f = k T . So agent i only needs to keep a running tally of E, a, and T 
for each of its possible moves to use data-aging, rather than a tally of all historical 
time-cost pairs 

3 Proof outline: Write the entropy after the update as a sum of non-i entropies 
(which are unchanged by the update) plus z’s new entropy. Then expand Ps new 
entropy. This gives the value of the new Lagrangian as -ln[iVi]. Then do the sub- 
traction. 
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Say we observe the functioning of such an organization over time, and view 
those observations as Monte Carlo sampling of its behavior. Then we can use 
those samples to statistically estimate how best to do serial/parallel Brouwer 
updating, for the purpose of minimizing the shared cost function G . This can 
be viewed as a way to optimize the organization chart coupling the players. 


4 Parallel Brouwer with data-aging is Nearest Newton 

This section considers a variant of best-response that is more realistic (more 
accurately modeling Rl^based computational players that are actually used 
in machine learning, and arguably more accurately modeling human players 
as well). In this variant the expected cost used by each player to update her 
strategy is a decaying average of recent expected utilities; This decay reflects 
a conservative preference for dampening large changes in strategy. 

Such a bias is used (implicitly or otherwise) in most multi-player RL algo- 
rithms. For example, in the COIN framework each agent i collects a data set 
of pairs of what value its private cost function has at timestep t together with 
the move it made then. It then estimates its cost for move x* as a weighted 
average of all the cost values in its data set for that move. The weights are 
exponentially decaying functions of how long ago the associated observation 
was made. This data-aging is crucial to reflect the non-st at ionarity of agent 
Vs environment, i.e., that the other agents are changing their strategies with 
time. Arguably, humans use similar modifications to best response. Indeed, in 
idealized learning rules like ficticious play, such dampening is crucial. 

4.1 The dynamics of Brouwer updating 

Consider a multi-stage game where at the end of iteration £, each player i 
updates her distribution &(.,£) to 






( 10 ) 


This is a generalization of parallel Brouwer updating, where the function being 
exponentiated can be Q values (as in Q-learning[24]), single-instant reward 
values, distorted versions of these (e.g., to incorporate data-aging), etc. 

As an example, for single-instant rewards (i.e., conventional parallel Brouwer), 
is player Vs estimate of (ft times) her conditional expected cost for 
taking move X{ at time t— 1. If that estimate were exact, this would mean 

*:(<)«(<) (*(<)>*“ l)9i(xi,X(i))- ( 11 ) 


$i{xi,t ) = /3E(g { | 


= pJ 


As another example, for Q-learning, one player is Nature and her distribution 
is always a delta function. In this case ^(x£,t) is the Q- value for player i 
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taking action x when the state of Nature is as specified by the associated 
delta function in <?(., t — 1). 

Note that there’s no Monte Carlo sampling being done here, as there is 
in most real-world RL; this is a somewhat abstracted version of such RL. 
Alternatively, the analysis here becomes exact when <Pi is evaluated closed 
form, or (as when <Pi is an empirical expectation value) there’s enough samples 
in a Monte Carlo block so that empirical averages effectively give us exact 
values of expected quantities. 

At this point we have to say something about how $i evolves with time. 
Consider the case where d>i is an estimate of some function (ft i) formed by 
exponential aging of the previous (ft values. In our case (since everything is 
evaluated closed form) assuming there have been an infinite number of pre- 
ceding timesteps, this is the same as geometric data- aging: 

&i(xi, t) = acftifa, q(t - 1)) + (1 - a)$i(xi, t- 1) (12) 

for some appropriate function (fti 4 . For example, in parallel Brouwer updating, 
(fti(xi,t) — /3£(pi | while is a geometric average of the 

previous values of (ft(x{). 


4.2 The continuum-time limit 


To go to the continuum-time limit, let t be a real variable, and replace the 
temporal delay value of 1 in Eq. 12 with 5 and a with a 8 (we’ll eventually 
take 8 — > 0). In addition differentiate Eq. 10 with respect to t to get 


dq(xi,t) d<Pi(xi, t) 

— f)[ 


dt 


dt 




dt 1 


(13) 


In the 8 — > 0 limit, assuming q is a continuous function of t , Eq. 12 becomes 


d *-fo’ g) = -**(*<,«)]• ( 14 ) 

where from now on the t variable is being suppressed for clarity. 

If we knew the dynamics of (fti , we could solve Eq. 14 via integrating factors, 
in the usual way. Instead, here we’ll plug that equation for into Eq. 13. 
Then use Eq. 10 to write <P l (xi,q) = constant — ln(#i(xi)). The result is 


= ocqi(xi) [<j>i{xi, q) + ln(gi(xj))] 

J dx'i agi(*0[&(*<) + ln(9i(*i))l- 


( 15 ) 


4 To see this is exponential data-aging with exponent 7 set 7 = -ln(l — ot)). 
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4.3 Relation with Nearest Newton descent and replicator 
dynamics 

As mentioned previously, there are many ways to find equilibria, and in par- 
ticular many distributed algorithms for doing so. This is especially so in team 
games, where finding such equilibria reduces to descending a single over- 
arching Lagrangian. One natural idea for descent in such games is to use 
the Newton-Raphson descent algorithm. However that algorithm cannot be 
applied directly to search across q in a distributed fashion, due to the need 
to invert matrices coupling the agents. As an alternative, one can consider 
what new distribution p the Newton algorithm would step to if there was no 
restriction that p be a product distribution. One can then ask what product 
distribution is closest to p, according to Kullback-Leibler distance[9]. It turns 
out that one can solve for that optimal product distribution. The associated 
update rule is called the Nearest Newton algorithm [29]. 

It turns out that when one writes down the Nearest Newton update rule, 
it says to replace each component qi(xi) with the exact quantity appearing 
on the right-hand side of Eq. 15, where a is the stepsize of the update, and 
<j>i(xi,t) = (3E{G | £», <?(»)(£))> 35 m parallel Brouwer updating for a team 
game 5 . In other words, in team games, the continuum limit of having each 
player using (bounded rational) best response is identical to the continuum 
limit of the Newton-Raphson algorithm for descending the Lagrangian, with 
the data-aging parameter a giving the stepsize. 

Eq. 15 arises in other yet other contexts as well. In particular, say <&i is 
conditional expected rewards (i.e., <!>i{xi,t — 1) = E(gi \ q(.,t — 1))). Then 
the /3 — ► oo limit of Eq. 15 reduces to a simplified form of the replicator 
dynamics equation of evolutionary game theory [21, 23]. (If the stepsize a is 
an appropriately increasing function of E(G) other versions of that dynamics 
arise.) This is because in that limit the In term disappears, and the righthand 
side of Eq. 15 involves only the difference between player t’s expected cost 
and the average expected cost of all players. This 3- way connection suggests 
using some of the techniques for solving replicator dynamics to expedite either 
parallel Brouwer or Nearest Newton. 

4.4 Convergence and equilibria 

By Eq. 15, at equihbrium, for each z, qi(xi)[<pi{xi,q)+\a{qi{xi))\ must be inde- 
pendent of i. One way this can occur is if it equals 0. However q l (x l ) can never 
be 0, by Eq. 10. This means we have an equilibrium at qt{xi ) oc 
Intuitively, this is exactly what we want, according to Eq. 10 and our inter- 
pretation of <fri(xi,q) as an estimate of <fii(xi,q). Note also that this solution 
means that (j>i{x u q ) = $i{xi,q), so that (according to Eq. 14) ^i(xi,q) has 
also reached an equilibrium. 

5 More generally, Nearest Newton uses this update rule with (j>i(xi,t) = (3E{gi | 
x t ,q(i)(t)) where each gi(x) = G(x) — D{x (q) for some function D. See [29]. 
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When our equilibrium has qi(zi)[<f>i(xi, q ) -f ln(^(^))] = A ^ 0, we have 

qi (xi) <xe- qiixi) ^ Xi ' q) . (16) 

In light of Eq. 10, this means that $i(xi, q ) ^ <pi{xu q)- So by Eq. 14, $i(xi, q) 
hasn’t reached an equilibrium in this case: 

- a<f)i(xi,q)[l - gi(xi)]. (17) 

If both qi(xi ) and (pi(xi, q) were frozen at this point, this solution for $i(xi, q) 
would not obey Eq. 12. So either qi(x t ) and/or <pi(xi,q) cannot be frozen. In 
fact, if <pi(xi,q) varies with time, then we know by Eq. 15 that qi{xi) varies 
as well. So in either case qi{xi) must vary, i.e., this equilibrium is not stable. 

Although the dynamics has the desired fixed point, it may take a long time 
to converge there. There are several ways to analyze that: One is to examine 
the second derivatives (with respect to time) of the g* and/or the Another 
is to examine the time-dependence of the residual error, 

^ e(x *’ ty ~ ~ (18) 

The next subsection includes a convergence analysis involving residual errors, 
but for a different variant of Brouwer from the ones considered so far. 

4.5 Other variants of Brouwer updating 

Data-aging can be viewed as moving only part-way from the current ^ to 
what it should be (i.e. to (pi). An alternative is to dispense with the and (pi 
altogether, and instead step part- way from the current q to what it should be, 
i.e., partially move to the (bounded rational) best response mixed strategy. 
Formally, this means replacing Eq. 10 so that the update is not implicit, in 
how $i(xi,t) depends on the past value of q(t — 1) (Eq. 12), but explicit: 

qi(xi,t) = qi(xi,t- 1) + a[hi(xi,q^(t - 1)) - qi(xi,t- 1)] (19) 

where hi{xi^q^{t)) is the Boltzmann distribution of what q x (xi,t) would be, 
under ideal circumstances, and we implicitly have small stepsize a. 

The only fixed point of this updating rule is where qi = hi Vz. So just 
like with continuum-limit parallel Brouwer, we have the correct equilibrium. 
To investigate how fast the update rule of Eq. 19 arrives at that equilibrium, 
write its error at time t as the residual 

rfixut) = q t (xi,t) - hi(xi,q(i)(t)) 

= qi(x it t- l)[l-a] + ahi(Xi,q(i)(t- 1)) - ?«(*)) 

= Qi{xi,t — 1)[1 — a] + ahi(xi,q {i) (t - 1)) 

- hi[xi, q (i) (t - 1) + a[h {i) (q(t - 1)) - q {i) (t - 1)]] (20) 
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where we have assumed that all all players other than i are updating them- 
selves in the same that i does (i.e., via Eq. 19), and h^{q(t — l)) means the 
vector of the values of all hj^i(x 3 ) evaluated for q(t — 1). 

With obvious notation, rewrite Eq. 20 as 

r?(xi,t) = qi(xi,t — 1)[1 — a] 

+ ahi(xi,q {i) {t-l)) 

~ hi[ii, q {i) (t - 1) - ar {i) (t - 1)]. (21) 

Now use the fact that a is small to expand the last hi term on the righthand 
side to first order in its second (vector- valued) argument, getting the result 

rf(xi,t) « ri(ii,t)[l - a] + aVh { - r (i) (f - 1) (22) 

where the gradient of hi is with respect to the vector components of its second 
argument. Accordingly, if rf*(xt) starts much larger than the other residuals, 
it will be pushed down to their values. Conversely, if it starts much smaller 
than them, it will rise. 

There are other ways one can reduce a stochastic game to a deterministic 
continuum-time process. In particular, this can be done in closed form for 
ficticious play games and some simple variants of it [19, 10]. 

Acknowledgements: I would like to thank Stefan Bieniawski, Bill Macready, 
George Judge, Chris Henze, and Ban Kroo for helpful discussion. 
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